On the overestimation of random forest’s out-of-bag error
نویسنده
چکیده
Background The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique which is often used to evaluate the accuracy of a random forest as well as for selecting appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it was shown that the out-of-bag error overestimates the true prediction error. Based on simulated and real data this paper aims to identify settings for which the overestimation is likely. Moreover, the overestimation was shown to depend on the parameter mtry. Therefore, it is questionable if the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry. Results The simulation-based and real-data based studies with metric predictor variables show that the overestimation is largest in balanced settings and in settings with few observations , a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. Conclusions Although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore recommended to always use stratified subsampling for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error.
منابع مشابه
Variable Selection in Random Forest with Application to Quantitative Structure-Activity Relationship
A wrapper variable selection procedure is proposed for use with learning machines that generate a measure of variable importance, such as Random Forest. The procedure is based on iteratively removing low-ranking variables and assessing the learning machine performance by cross-validation. The procedure is implemented for Random Forest on some QSAR modeling examples from drug discovery and devel...
متن کاملCost-Complexity Pruning of Random Forests
Random forests perform boostrap-aggregation by sampling the training samples with replacement. This enables the evaluation of out-of-bag error which serves as a internal crossvalidation mechanism. Our motivation lies in using the unsampled training samples to improve each decision tree in the ensemble. We study the effect of using the out-of-bag samples to improve the generalization error first...
متن کاملRoot Attribute Behavior within a Random Forest
Random Forest is a computationally efficient technique that can operate quickly over large datasets. It has been used in many recent research projects and real-world applications in diverse domains. However, the associated literature provides few information about what happens in the trees within a Random Forest. The research reported here analyzes the frequency that an attribute appears in the...
متن کاملWeighted Decisions in a Fuzzy Random Forest
A multi-classifier system obtained by combining several individual classifiers usually exhibits a better performance (precision) than any of the original classifiers. In this work we use a multi-classifier based on a forest of randomly generated fuzzy decision trees (Fuzzy Random Forest), and we propose a new method to combine their decisions to obtain the final decision of the forest. The prop...
متن کاملThe Number of Patients and Events Required to Limit the Risk of Overestimation of Intervention Effects in Meta-Analysis—A Simulation Study
BACKGROUND Meta-analyses including a limited number of patients and events are prone to yield overestimated intervention effect estimates. While many assume bias is the cause of overestimation, theoretical considerations suggest that random error may be an equal or more frequent cause. The independent impact of random error on meta-analyzed intervention effects has not previously been explored....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017